{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "36df8562",
   "metadata": {},
   "source": [
    "### Compute UMAP embedding based on biomolecular structures\n",
    "A file containing the precomputed UMAP embedding based on the 20k biomolecular structures set ([biostructures_20k.csv](biostructures_20k.csv)) is provided, enabling the projection of new structures onto this embedding.\n",
    "MCES distances of all new structures to all biomolecular structures have to be provided (computation of Myopic MCES distances via [https://github.com/AlBi-HHU/molecule-comparison](https://github.com/AlBi-HHU/molecule-comparison))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "44a1cba0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "import pandas as pd\n",
    "import pickle\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f2f2cd0",
   "metadata": {},
   "source": [
    "#### Load newly computed MCES distances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "abdc0a04",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "len(biostructures_set)=18096\n"
     ]
    }
   ],
   "source": [
    "new_distances = pd.read_csv('new_mces_distances_example.csv')\n",
    "# for biomolecular structures, do not consider outlier clusters\n",
    "outlier_indices = [int(l.strip()) for l in open('biostructures_20k_outlier_indices.txt')]\n",
    "biostructures_set = [l.strip() for i, l in enumerate(pd.read_csv('biostructures_20k.csv').smiles.tolist()) \n",
    "                     if i not in outlier_indices]\n",
    "print(f'{len(biostructures_set)=}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a54f6bb5",
   "metadata": {},
   "source": [
    "#### bring to into correct format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "dd34b04d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mces_array.shape=(10, 18096)\n"
     ]
    }
   ],
   "source": [
    "new_smiles = new_distances.smiles1.unique().tolist()\n",
    "new_distances = new_distances.set_index(['smiles1', 'smiles2'])\n",
    "mces_array = np.full((len(new_smiles), len(biostructures_set)), np.nan)\n",
    "for i, smiles1 in enumerate(new_smiles):\n",
    "    for j, smiles2 in enumerate(biostructures_set):\n",
    "        mces_array[i, j] = new_distances.loc[(smiles1, smiles2), 'mces']\n",
    "print(f'{mces_array.shape=}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac5065de",
   "metadata": {},
   "source": [
    "#### load precomputed UMAP embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df39a54b",
   "metadata": {},
   "outputs": [],
   "source": [
    "umap = pickle.load(open('umap_embedding_biostructures.pkl', 'rb'))\n",
    "umap_projected = umap.transform(mces_array)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e60d9c6d",
   "metadata": {},
   "source": [
    "#### Append new distances to `umap_df.csv` to visualize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1dbb1fcc",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_distances_df = pd.DataFrame({'umap1': umap_projected[:, 0], 'umap2': umap_projected[:, 1], \n",
    "                                 'smiles': new_smiles, \n",
    "                                 'set': ['new_distances'] * len(new_smiles)})\n",
    "umap_df = pd.read_csv('umap_df.csv')\n",
    "umap_df = pd.concat([umap_df, new_distances_df], axis=0).to_csv('umap_df.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f1d4c50",
   "metadata": {},
   "source": [
    "#### Visualization\n",
    "see [display_umap.ipynb](display_umap.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "umap_mces_mamba",
   "language": "python",
   "name": "umap_mces_mamba"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
